Goto

Collaborating Authors

 panda dataframe


Introduction to the Usage of Open Data from the Large Hadron Collider for Computer Scientists in the Context of Machine Learning

Saala, Timo, Schott, Matthias

arXiv.org Artificial Intelligence

Deep learning techniques have evolved rapidly in recent years, significantly impacting various scientific fields, including experimental particle physics. To effectively leverage the latest developments in computer science for particle physics, a strengthened collaboration between computer scientists and physicists is essential. As all machine learning techniques depend on the availability and comprehensibility of extensive data, clear data descriptions and commonly used data formats are prerequisites for successful collaboration. In this study, we converted open data from the Large Hadron Collider, recorded in the ROOT data format commonly used in high-energy physics, to pandas DataFrames, a well-known format in computer science. Additionally, we provide a brief introduction to the data's content and interpretation. This paper aims to serve as a starting point for future interdisciplinary collaborations between computer scientists and physicists, fostering closer ties and facilitating efficient knowledge exchange.


Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code

Galimzyanov, Timur, Titov, Sergey, Golubev, Yaroslav, Bogomolov, Egor

arXiv.org Artificial Intelligence

This paper introduces the human-curated PandasPlotBench dataset, designed to evaluate language models' effectiveness as assistants in visual data exploration. Our benchmark focuses on generating code for visualizing tabular data - such as a Pandas DataFrame - based on natural language instructions, complementing current evaluation tools and expanding their scope. The dataset includes 175 unique tasks. Our experiments assess several leading Large Language Models (LLMs) across three visualization libraries: Matplotlib, Seaborn, and Plotly. We show that the shortening of tasks has a minimal effect on plotting capabilities, allowing for the user interface that accommodates concise user input without sacrificing functionality or accuracy. Another of our findings reveals that while LLMs perform well with popular libraries like Matplotlib and Seaborn, challenges persist with Plotly, highlighting areas for improvement. We hope that the modular design of our benchmark will broaden the current studies on generating visualizations. Our benchmark is available online: https://huggingface.co/datasets/JetBrains-Research/plot_bench. The code for running the benchmark is also available: https://github.com/JetBrains-Research/PandasPlotBench.


Exploring the Potential of AI-Generated Synthetic Datasets: A Case Study on Telematics Data with ChatGPT

Lingo, Ryan

arXiv.org Artificial Intelligence

This research delves into the construction and utilization of synthetic datasets, specifically within the telematics sphere, leveraging OpenAI's powerful language model, ChatGPT. Synthetic datasets present an effective solution to challenges pertaining to data privacy, scarcity, and control over variables - characteristics that make them particularly valuable for research pursuits. The utility of these datasets, however, largely depends on their quality, measured through the lenses of diversity, relevance, and coherence. To illustrate this data creation process, a hands-on case study is conducted, focusing on the generation of a synthetic telematics dataset. The experiment involved an iterative guidance of ChatGPT, progressively refining prompts and culminating in the creation of a comprehensive dataset for a hypothetical urban planning scenario in Columbus, Ohio. Upon generation, the synthetic dataset was subjected to an evaluation, focusing on the previously identified quality parameters and employing descriptive statistics and visualization techniques for a thorough analysis. Despite synthetic datasets not serving as perfect replacements for actual world data, their potential in specific use-cases, when executed with precision, is significant. This research underscores the potential of AI models like ChatGPT in enhancing data availability for complex sectors like telematics, thus paving the way for a myriad of new research opportunities.


Calculate Variance in Pandas DataFrame

#artificialintelligence

Pandas is a Python library that is widely used to perform data analysis and machine learning tasks. It is open-source and very powerful, fast, and easy to use. Basically, while working with big data we need to analyze, manipulate and update them and the pandas' library plays a lead role there. Sometimes, we need to calculate the variance in a Pandas DataFrame. Variance is a statistical term that refers to the measurement of dispersion that calculates the spread of all data points in a data set.


Solving Spotify Multiclass Genre Classification Problem

#artificialintelligence

The music industry has become more popular, and how people listen to music is changing like wildfire. The development of music streaming services has increased the demand for automatic music categorization and recommendation systems. Spotify, one of the world's leading music streaming sites, has millions of subscribers and a massive song catalog. Yet, for customers to have a personalized music experience, Spotify must recommend tracks that fit their preferences. Spotify uses machine learning algorithms to guide and categorizes music based on the Genre.


Is a Small Dataset Risky?. Some reflections and tests on the use…

#artificialintelligence

Recently I have written an article about the risks of using the train_test_split() function provided by the scikit-learn Python package. That article has raised a lot of comments, some positives, and others with some concerns. The main concern in the article was that I used a small dataset to demonstrate my theory, which was: be careful when you use the train_test_split() function, because the different seeds may produce very different models. The main concern was that the train_test_split() function does not behave strangely; the problem is that I used a small dataset to demonstrate my thesis. In this article, I try to discover which is the performance of a Linear Regression model by varying the dataset size.


Fake It Till You Make It: Generating Realistic Synthetic Customer Datasets - KDnuggets

#artificialintelligence

Being able to create and use synthetic data in projects has become a must-have skill for data scientists. I have written in the past about using the Python library Faker for creating your own synthetic datasets. Instead of repeating anything in that article, let's treat this as the second in a series of generating synthetic data for your own data science projects. This time around, let's generate some fake customer order data. If you don't know anything about Faker, how it is used, or what you can do with it, I suggest that you check out the previous article first.


Python Machine Learning Mini-Course

#artificialintelligence

It takes you 14 days to learn how to begin using Python to build accurate predictive models and confidently complete machine learning projects. Take advantage of my referral link today and become a medium member. For just $5 a month, you will have access to everything Medium has to offer. By becoming a member, I will receive $2 from $5, which will assist me in maintaining this blog. There is a lot of important information in this post. Bookmark it if you find it useful.


Complete Guide to Pandas DataFrame with real-time use case

#artificialintelligence

Originally published on Towards AI the World's Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses. After my Pyspark Series -- where readers are mostly interested in Pyspark Dataframe and Pyspark RDD, I got suggestions and requests to write on Pandas DataFrame, So that one can compare between Pyspark and Pandas not in consumption terms but in Syntax terms.


3 Ways to Append Rows to Pandas DataFrames - KDnuggets

#artificialintelligence

In this mini tutorial, we will learn three ways to append rows to pandas dataframe. We will also learn about the most effective and easy ways to add multiple rows. We will use pandas DataFrame() and input data in the form of a dictionary to create a sample dataframe for the students enrolled in the online master's degree. We have five columns and five distinct rows. It will be the base dataframe.